Predicting suitable habitats of Melia azedarach L. in China using data mining

Melia azedarach L. is an important economic tree widely distributed in tropical and subtropical regions of China and some other countries. However, it is unclear how the species’ suitable habitat will respond to future climate changes. We aimed to select the most accurate one among seven data mining models to predict the current and future suitable habitats for M. azedarach in China. These models include: maximum entropy (MaxEnt), support vector machine (SVM), generalized linear model (GLM), random forest (RF), naive bayesian model (NBM), extreme gradient boosting (XGBoost), and gradient boosting machine (GBM). A total of 906 M. azedarach locations were identified, and sixteen climate predictors were used for model building. The models’ validity was assessed using three measures (Area Under the Curves (AUC), kappa, and overall accuracy (OA)). We found that the RF provided the most outstanding performance in prediction power and generalization capacity. The top climate factors affecting the species’ suitable habitats were mean coldest month temperature (MCMT), followed by the number of frost-free days (NFFD), degree-days above 18 °C (DD > 18), temperature difference between MWMT and MCMT, or continentality (TD), mean annual precipitation (MAP), and degree-days below 18 °C (DD < 18). We projected that future suitable habitat of this species would increase under both the RCP4.5 and RCP8.5 scenarios for the 2011–2040 (2020s), 2041–2070 (2050s), and 2071–2100 (2080s). Our findings are expected to assist in better understanding the impact of climate change on the species and provide scientific basis for its planting and conservation.

. Maxent models have been widely used in the fields of crop niches, plant diseases and insect pests, and species invasion prediction 16,17 . Therefore, we compared these modeling approaches through data mining techniques to identify the most effective modeling approach to predict the suitable habitat of M. azedarach species distribution based on the relationship between its occurrence and climate variables.
Understanding the potential impact of climate change on the suitable habitat of M. azedarach is of great significance to its cultivation and conservation in China. Studies conducted on M. azedarach were mainly focused on tree and stand productivity, extraction of active ingredients, and pest resistance potential 3,18 . Research on M. azedarach potential distribution as affected by climate change is lacking and thus, the present study is aimed at exploring the above-mentioned seven data mining techniques to establish climate-based distribution prediction models and select the best model in predictions of the species future suitable habitat. Our specific objectives were to: (1) compare the prediction accuracy of the seven modeling algorithms and select the one with the best performance; (2) determine the key climatic factors related to suitable habitat; (3) develop current and future suitable habitat maps for M. azedarach in China and highlighting the areas of change; and (4) assess the potential impact of future climate change on the species suitable habitat.

Material and methods
Species location data. Here, we used the Chinese presence and absence M. azedarach data to establish the prediction models. First, we found 1,432 presence data (data source: Global Biodiversity Information Facility (GBIF), https:// doi. org/ 10. 15468/ dl. 3t8r62, accessed on 17 May 2022, and the Chinese Virtual Herbarium (CVH), http:// www. cvh. ac. cn/, accessed on 17 May 2022). All the M. azedarach distribution data have been licensed. To avoid redundant sampling, we deleted those sample points with similar longitude and latitude 19 . Then a 0.01° mesh thinning was performed, and the actual distance corresponding to 0.01° was about 1 km and only one distribution point was reserved in each grid so that the distance between sample points was more than 1 km 20 . A total of 906 samples were included for model building. Finally, we used ArcMap 10.2 to overlay the asc result file generated by the model with the map of China to generate the final result map (Fig. 1). In addition, all maps in our study were created in ArcMap 10.2. Environment variables. We used M. azedarach presence-absence data as the dependent variable and 16 climatic factors derived from ClimateAP_v221 software (http:// Clima teAP. net) as predictors to build the model (Table S1) 21 . We used the following tests to avoid the effect of multicollinearity among the climate variables. First, the variance inflation factor (VIF) was calculated for each of the 16 variables (Table S2). Second, the correlation analysis was conducted for each pair of the 16 variable ( Figure S1). Finally, we used stepwise regression analysis to eliminate the variables that led to an observed multicollinearity 22 . Model development and prediction. We used seven models (Generalize Linear Model (GLM), Gradient Boosting Machine (GBM), Random Forest (RF), Support Vector Machine (SVM), Maximum Entropy (Max-Ent), Extreme Gradient Boosting (XGBoost), and Naive Bayesian Model (NBM)) to associate the distribution of M. azedarach with climate variables. We used a data-driven approach to select the number of pseudo-existent points, and started with 1000, 2000, 10,000, 30,000, and 100,000 pseudo-nonexistent points. It was found that

Model validation.
To assess the performance of the seven predictive models, we compared their area under receiver operating character curve (AUC), Kappa, and overall accuracy (OA). The AUC is the probability value, with evaluation criteria were: 0.5-0.6 = fails, 0.6-0.7 = poor, 0.7-0.8 = fair, 0.8-0.9 = good, 0.9-1.0 = excellent 23 . Kappa coefficient is an index to measure classification accuracy. The calculation result of kappa is − 1 to 1, but usually, kappa falls between 0 and 1, which can be divided into five groups: 0.0-0.2 means very low consistency, 0.21-0.40 means general consistency, 0.41-0.60 means moderate consistency, 0.61-0.80 means high consistency, 0.81-1 means almost perfect 24 . Both Kappa and AUC consider the true positive rate and true negative rate to avoid an overestimation or underestimation error (Sahin 2020).
Habitat classification. Appropriate habitat evaluation index values were determined as follows: predicted values of 0-0.2, 0.2-0.4, 0.4-0.6, and > 0.06 were deemed unsuitable, low-, medium-, and high-suitable habitat, respectively 25 (All methods were performed in accordance with the relevant guidelines and regulations).  (Table 1). Figure 4 displayed the relationships between the top six climate variables and M. azedarach suitability according to the predictions of RF algorithms. The habitat suitable range was between − 10 and − 28 °C for MCMT  (Fig. 6f). Additionally, the species stable range area showed the same change pattern as that of the expanded area (Fig. 6g).    (Fig. 6a-f). Furthermore, the species area loss exhibited an opposite trend to that of expansion and stable range areas (Fig. 6f) and most of the loss area was mainly distributed in eastern coastal provinces near 30-38° N (e.g., Shandong (SD)) (Fig. 6).

Discussion
Model performance. Here, we used the AUC, Kappa, and OA to evaluate the performance of seven species range prediction models (GLM, GBM, MaxEnt, SVM, XGBoost, NBM, and RF) to predict M. azedarach contemporary and future ranges under two climate scenarios (RCP 8.5 and RCP 4.5). The results showed that RF and XGBoost were the top-performing models with RF being the best, while NBM and GLM were the lowperforming with the NBM being the worst. Similarly, multiple lines of evidence support the superiority of the RF algorithm 26 . In a study in northern California, the GLM, ANN, RF and ME models were used to predict new occurrences for rare plants, and the results showed that RF provided the best prediction 27 . Akpoti et al. used BRT, GLM, MAXNT and RF algorithms to predict rice production suitability and the results showed that RF has better generalizability 28 . Silva et al. found the highest model quality for the RF and GAM algorithms when assessing the limitations of different species distribution models using the Azorean Forest as an example 29 . The RF is an   30 . Additionally, the RF model is capable of avoiding the accuracy reduction problem caused by missing and noisy data in the training sample when predicting the relationship between a large number of predictor variables and the response variable 31 , attributes supporting the present study results. In contrast, while the NBM like RF is also a machine learning algorithm, it was proven to be not very sensitive to missing data, and the algorithm is relatively simple 32 . Studies have demonstrated that more complex species distributions models provided better predictive performance demonstrating the suitability of the RF model in processing complex high-dimensional data such as the data used in the present study 33 . Moreover, the NBM is a linear classifier and similar to the traditional linear statistical methods, all are insufficient in revealing the complex relationship among environmental variables 34 . In our case, the two linear models, GBM and GLM, demonstrated this with their poor predictive power. Additionally, we observed that the prediction accuracy of the XGBoost was very close to that of RF as the XGBoost has good generalization performance 35 . Although, previous studies have shown that MaxEnt, SVM, and GBM models performed well in simulating species suitability distribution 36,37 , our results have shown that the prediction accuracy of these models was intermediate relative to the performance of the seven tested models. These phenomena may indicate that species characteristics and sample size also have influence on the accuracy of species distribution models 38 .
The importance of climate variables. Our study along with several others [39][40][41] were based on the assumption that species distribution is mainly determined by climate 42,43 . It is well documented that climatic factors are key elements for most species' population regeneration 44 . Here, our results indicated that temperatureassociated climate factors have greater influence on M. azedarach suitable habitats than precipitation factors. Specifically, five of the top six contributing climatic variables were related to low temperature (MCMT, NFFD, and DD < 18) and continentality (TD), with MCMT contributing the most. This shows that low temperature was the main climatic factor that restricted M. azedarach suitable habitat, which is consistent with previous studies, as low-temperature stress imparted a negative impact on plant physiological and biochemical responses (e.g., plant membrane system disorder, photosynthetic rate decline, harmful active oxygen increased, and osmotic adjustment substances increase) 45 . The extension of the number of frost-free days (NFFD) was beneficial to increasing M. azedarach seed size and quality, thereby improving the survival rate 46  Range shift in response to climate change. Our study showed that M. azedarach would benefit from the anticipated climate change. More specifically, we found the RCP 8.5 scenario to be more favorable for the species habitat suitability expansion as compared to the RCP 4.5 scenario (Fig. 5g). The RCP 8.5 scenario predicted a greater increase in future temperature warming and precipitation, providing climatic conditions favorable to the species growth 46 . From the species geographic range change point of view, it is expected that the future suitable habitat distribution to expand north-and west-ward. Compared with the RCP4.5 scenario, the predicted trend of suitable habitats changes of the RCP8.5 scenario was more significant in the plateau area near 40° N (Fig. 5), including the Xinjiang Tarim Basin (RCP8.5) (Fig. 5f). Under the RCP4.5 and RCP8.5 scenarios, the future temperature is envisaged to rise by 1.4-1.8 and 2.0-3.7 °C, respectively, making high latitude areas www.nature.com/scientificreports/ warmer, resulting in a contemplated rise of mountains tree line, which would ultimately provide the species with a potential of geographic range expansion 49 . At the same time, we noted that the suitable habitat in the Shandong region would experience substantial range loss (Fig. 5), caused by a drastic change in climatic conditions from mainly dry continental airflow with little precipitation to a future warmer climate associated with intensified precipitation reduction 50 . Additionally, the impact of subtropical high pressure could not be overlooked as the Shandong is often affected by sinking air currents with long periods of high temperature and low precipitation. This subtropical high pressure is expected to gradually moved northward, followed by anticipated clear trend of northward movement associated with precipitation pattern change in the Shandong 51 . To a certain extent, the  Management strategies. Rapid climate change causes most tree populations to exist in unsuitable environmental conditions, threatening their growth and survival and even leading to population extinction 53 . Some tree species adapted to the new climatic conditions by migrating to the same environmental gradient or evolving 54 ; however, other tree species would benefit from climate change 55 . M. azedarach belongs to those species who would benefit from future climate change leading to anticipated range expansion. The wide distribution of M. azedarach harbours abundant phenotypic variation and most of the species phenotypic diversity is mainly distributed in the southwest and south regions and to a lesser extent in other regions 56 . It is worth noting that if a widely distributed species could not track the changing climate due to long-term local adaptation, they would become more vulnerable 57 . Therefore, to prevent this uncertainty, we suggest taking proactive in-situ conservation measures for Yunnan, Guizhou, Sichuan, Guangdong, and Guangxi regions, as they are rich in phenotypic diversity which will help in coping with future environmental uncertainty 58 . Assisted migration initiatives should apply to presently unsuitable habitats that are expected to be suitable in the future. For example, the northern regions of Jiangxi, Hubei, Anhui, Henan, and areas near 40° N are reasonable targets for assisted migration conservation measures 59 . We recommend for areas that would be negatively affected by future climate as Shandong, taking ex-situ measures through establishing botanical gardens and seed banks in suitable habitats to protect their resources. Therefore, analyzing the ex-situ target areas' climate ecology could provide reference for breeding programs and seed transfer guidelines/polices. At the same time, we suggest that other biological factors along with climate should also be considered in the species future research, such as species interaction (allelopathy, soil nutrient competition), land-use change (bio-energy farmland expansion), and the influence of human activities 60,61 , these factors collectively affect the contemporary and future distribution of M. azedarach.

Conclusion
Here, we used three common model accuracy evaluation indicators to compare the suitability of seven data mining techniques for predicting M. azedarach distribution. The RF model, with its strong robustness and stability, provided the highest accuracy in establishing a climate niche model. Based on this model, maps of contemporary and future suitable habitats were developed. The RF prediction results indicated that M. azedarach would benefit from future climate change through range expansion and this has tendency towards north-and west-ward expansion. In order to maximize the species protection and development, we recommend taking a proactive in-situ conservation measures to conserve genetic variation for adaptation to uncertainties and ex-situ conservation to protect genetic resources under risk, and assisted migration to better use the areas with good potential in future climates.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.